39 research outputs found

    Optimization of high-throughput real-time processes in physics reconstruction

    Get PDF
    La presente tesis se ha desarrollado en colaboración entre la Universidad de Sevilla y la Organización Europea para la Investigación Nuclear, CERN. El detector LHCb es uno de los cuatro grandes detectores situados en el Gran Colisionador de Hadrones, LHC. En LHCb, se colisionan partículas a altas energías para comprender la diferencia existente entre la materia y la antimateria. Debido a la cantidad ingente de datos generada por el detector, es necesario realizar un filtrado de datos en tiempo real, fundamentado en los conocimientos actuales recogidos en el Modelo Estándar de física de partículas. El filtrado, también conocido como High Level Trigger, deberá procesar un throughput de 40 Tb/s de datos, y realizar un filtrado de aproximadamente 1 000:1, reduciendo el throughput a unos 40 Gb/s de salida, que se almacenan para posterior análisis. El proceso del High Level Trigger se subdivide a su vez en dos etapas: High Level Trigger 1 (HLT1) y High Level Trigger 2 (HLT2). El HLT1 transcurre en tiempo real, y realiza una reducción de datos de aproximadamente 30:1. El HLT1 consiste en una serie de procesos software que reconstruyen lo que ha sucedido en la colisión de partículas. En la reconstrucción del HLT1 únicamente se analizan las trayectorias de las partículas producidas fruto de la colisión, en un problema conocido como reconstrucción de trazas, para dictaminar el interés de las colisiones. Por contra, el proceso HLT2 es más fino, requiriendo más tiempo en realizarse y reconstruyendo todos los subdetectores que componen LHCb. Hacia 2020, el detector LHCb, así como todos los componentes del sistema de adquisici´on de datos, serán actualizados acorde a los últimos desarrollos técnicos. Como parte del sistema de adquisición de datos, los servidores que procesan HLT1 y HLT2 también sufrirán una actualización. Al mismo tiempo, el acelerador LHC será también actualizado, de manera que la cantidad de datos generada en cada cruce de grupo de partículas aumentare en aproxidamente 5 veces la actual. Debido a las actualizaciones tanto del acelerador como del detector, se prevé que la cantidad de datos que deberá procesar el HLT en su totalidad sea unas 40 veces mayor a la actual. La previsión de la escalabilidad del software actual a 2020 subestim´ó los recursos necesarios para hacer frente al incremento en throughput. Esto produjo que se pusiera en marcha un estudio de todos los algoritmos tanto del HLT1 como del HLT2, así como una actualización del código a nuevos estándares, para mejorar su rendimiento y ser capaz de procesar la cantidad de datos esperada. En esta tesis, se exploran varios algoritmos de la reconstrucción de LHCb. El problema de reconstrucción de trazas se analiza en profundidad y se proponen nuevos algoritmos para su resolución. Ya que los problemas analizados exhiben un paralelismo masivo, estos algoritmos se implementan en lenguajes especializados para tarjetas gráficas modernas (GPUs), dada su arquitectura inherentemente paralela. En este trabajo se dise ˜nan dos algoritmos de reconstrucción de trazas. Además, se diseñan adicionalmente cuatro algoritmos de decodificación y un algoritmo de clustering, problemas también encontrados en el HLT1. Por otra parte, se diseña un algoritmo para el filtrado de Kalman, que puede ser utilizado en ambas etapas. Los algoritmos desarrollados cumplen con los requisitos esperados por la colaboración LHCb para el año 2020. Para poder ejecutar los algoritmos eficientemente en tarjetas gráficas, se desarrolla un framework especializado para GPUs, que permite la ejecución paralela de secuencias de reconstrucción en GPUs. Combinando los algoritmos desarrollados con el framework, se completa una secuencia de ejecución que asienta las bases para un HLT1 ejecutable en GPU. Durante la investigación llevada a cabo en esta tesis, y gracias a los desarrollos arriba mencionados y a la colaboración de un pequeño equipo de personas coordinado por el autor, se completa un HLT1 ejecutable en GPUs. El rendimiento obtenido en GPUs, producto de esta tesis, permite hacer frente al reto de ejecutar una secuencia de reconstrucción en tiempo real, bajo las condiciones actualizadas de LHCb previstas para 2020. As´ı mismo, se completa por primera vez para cualquier experimento del LHC un High Level Trigger que se ejecuta únicamente en GPUs. Finalmente, se detallan varias posibles configuraciones para incluir tarjetas gr´aficas en el sistema de adquisición de datos de LHCb.The current thesis has been developed in collaboration between Universidad de Sevilla and the European Organization for Nuclear Research, CERN. The LHCb detector is one of four big detectors placed alongside the Large Hadron Collider, LHC. In LHCb, particles are collided at high energies in order to understand the difference between matter and antimatter. Due to the massive quantity of data generated by the detector, it is necessary to filter data in real-time. The filtering, also known as High Level Trigger, processes a throughput of 40 Tb/s of data and performs a selection of approximately 1 000:1. The throughput is thus reduced to roughly 40 Gb/s of data output, which is then stored for posterior analysis. The High Level Trigger process is subdivided into two stages: High Level Trigger 1 (HLT1) and High Level Trigger 2 (HLT2). HLT1 occurs in real-time, and yields a reduction of data of approximately 30:1. HLT1 consists in a series of software processes that reconstruct particle collisions. The HLT1 reconstruction only analyzes the trajectories of particles produced at the collision, solving a problem known as track reconstruction, that determines whether the collision data is kept or discarded. In contrast, HLT2 is a finer process, which requires more time to execute and reconstructs all subdetectors composing LHCb. Towards 2020, the LHCb detector and all the components composing the data acquisition system will be upgraded. As part of the data acquisition system, the servers that process HLT1 and HLT2 will also be upgraded. In addition, the LHC accelerator will also be updated, increasing the data generated in every bunch crossing by roughly 5 times. Due to the accelerator and detector upgrades, the amount of data that the HLT will require to process is expected to increase by 40 times. The foreseen scalability of the software through 2020 underestimated the required resources to face the increase in data throughput. As a consequence, studies of all algorithms composing HLT1 and HLT2 and code modernizations were carried out, in order to obtain a better performance and increase the processing capability of the foreseen hardware resources in the upgrade. In this thesis, several algorithms of the LHCb recontruction are explored. The track reconstruction problem is analyzed in depth, and new algorithms are proposed. Since the analyzed problems are massively parallel, these algorithms are implemented in specialized languages for modern graphics cards (GPUs), due to their inherently parallel architecture. From this work stem two algorithm designs. Furthermore, four additional decoding algorithms and a clustering algorithms have been designed and implemented, which are also part of HLT1. Apart from that, an parallel Kalman filter algorithm has been designed and implemented, which can be used in both HLT stages. The developed algorithms satisfy the requirements of the LHCb collaboration for the LHCb upgrade. In order to execute the algorithms efficiently on GPUs, a software framework specialized for GPUs is developed, which allows executing GPU reconstruction sequences in parallel. Combining the developed algorithms with the framework, an execution sequence is completed as the foundations of a GPU HLT1. During the research carried out in this thesis, the aforementioned developments and a small group of collaborators coordinated by the author lead to the completion of a full GPU HLT1 sequence. The performance obtained on GPUs allows executing a reconstruction sequence in real-time, under LHCb upgrade conditions. The developed GPU HLT1 constitutes the first GPU high level trigger ever developed for an LHC experiment. Finally, various possible realizations of the GPU HLT1 to integrate in a production GPU-equipped data acquisition system are detailed

    An efficient low-rank Kalman filter for modern SIMD architectures

    No full text
    The Kalman filter is a fundamental process in the reconstruction of particle collisions in high-energy physics detectors. At the LHCb detector in the Large Hadron Collider, this reconstruction happens at an average rate of 30 million times per second. Due to iterative enhancements in the detector's technology, together with the projected removal of the hardware filter, the rate of particles that will need to be processed in software in real-time is expected to increase in the coming years by a factor 40. In order to cope with the projected data rate, processing and filtering software must be adapted to take into account cutting-edge hardware technologies. We present Cross Kalman, a cross-architecture Kalman filter optimized for low-rank problems and SIMD architectures. We explore multi- and many-core architectures and compare their performance on single and double precision configurations. We show that under the constraints of our mathematical formulation, we saturate the architectures under study. We validate our results and integrate our filter in the LHCb framework. Our work will allow to better use the available resources at the LHCb experiment and enables us to evaluate other computing platforms for future hardware upgrades. Finally, we expect that the presented algorithm and data structures can be easily adapted to other applications of low-rank Kalman filters

    SIMD studies in the LHCb reconstruction software

    No full text
    During the data taking process in the LHC at CERN, millions of collisions are recorded every second by the LHCb Detector. The LHCb Online computing farm, counting around 15000 cores, is dedicated to the reconstruction of the events in real-time, in order to filter those with interesting Physics. The ones kept are later analysed Offline in a more precise fashion on the Grid. This imposes very stringent requirements on the reconstruction software, which has to be as efficient as possible.Modern CPUs support so-called vector-extensions, which extend their Instruction Sets, allowing for concurrent execution across functional units. Several libraries expose the Single Instruction Multiple Data programming paradigm to issue these instructions. The use of vectorisation in our codebase can provide performance boosts, leading ultimately to Physics reconstruction enhancements.In this paper, we present vectorisation studies of significant reconstruction algorithms. A variety of vectorisation libraries are analysed and compared in terms of design, maintainability and performance. We also present the steps taken to systematically measure the performance of the released software, to ensure the consistency of the run-time of the vectorised software

    An efficient low-rank Kalman filter for modern SIMD architectures

    No full text
    The Kalman filter is a fundamental process in the reconstruction of particle collisions in high-energy physics detectors. At the LHCb detector in the Large Hadron Collider, this reconstruction happens at an average rate of 30 million times per second. Due to iterative enhancements in the detector's technology, together with the projected removal of the hardware filter, the rate of particles that will need to be processed in software in real-time is expected to increase in the coming years by a factor 40. In order to cope with the projected data rate, processing and filtering software must be adapted to take into account cutting-edge hardware technologies. We present Cross Kalman, a cross-architecture Kalman filter optimized for low-rank problems and SIMD architectures. We explore multi- and many-core architectures and compare their performance on single and double precision configurations. We show that under the constraints of our mathematical formulation, we saturate the architectures under study. We validate our results and integrate our filter in the LHCb framework. Our work will allow to better use the available resources at the LHCb experiment and enables us to evaluate other computing platforms for future hardware upgrades. Finally, we expect that the presented algorithm and data structures can be easily adapted to other applications of low-rank Kalman filters

    SIMD studies in the LHCb reconstruction software

    No full text
    During the data taking process in the LHC at CERN, millions of collisions are recorded every second by the LHCb Detector. The LHCb Online computing farm, counting around 15000 cores, is dedicated to the reconstruction of the events in real-time, in order to filter those with interesting Physics. The ones kept are later analysed OfflineOffline in a more precise fashion on the Grid. This imposes very stringent requirements on the reconstruction software, which has to be as efficient as possible. Modern CPUs support so-called vector-extensions, which extend their Instruction Sets, allowing for concurrent execution across functional units. Several libraries expose the Single Instruction Multiple Data programming paradigm to issue these instructions. The use of vectorisation in our codebase can provide performance boosts, leading ultimately to Physics reconstruction enhancements. In this paper, we present vectorisation studies of significant reconstruction algorithms. A variety of vectorisation libraries are analysed and compared in terms of design, maintainability and performance. We also present the steps taken to systematically measure the performance of the released software, to ensure the consistency of the run-time of the vectorised software

    A High-Throughput Kalman Filter for Modern SIMD Architectures

    No full text
    The Kalman filter is a critical component of the reconstruction process of subatomic particle collision in high-energy physics detectors. At the LHCb detector in the Large Hadron Collider this reconstruction must be performed at an average rate of 30 million times per second. As a consequence of the ever-increasing collision rate and upcoming detector upgrades, the data rate that needs to be processed in real time is expected to increase by a factor of 40 in the next five years. In order to keep pace, processing and filtering software must take advantage of latest developments in hardware technolog

    A High-Throughput Kalman Filter for Modern SIMD Architectures

    No full text
    The Kalman filter is a critical component of the reconstruction process of subatomic particle collision in high-energy physics detectors. At the LHCb detector in the Large Hadron Collider this reconstruction must be performed at an average rate of 30 million times per second. As a consequence of the ever-increasing collision rate and upcoming detector upgrades, the data rate that needs to be processed in real time is expected to increase by a factor of 40 in the next five years. In order to keep pace, processing and filtering software must take advantage of latest developments in hardware technolog

    High-speed zero-copy data transfer for DAQ applications

    No full text
    The LHCb Data Acquisition (DAQ) will be upgraded in 2020 to a trigger-free readout. In order to achieve this goal we will need to connect around 500 nodes with a total network capacity of 32 Tb/s. To get such an high network capacity we are testing zero-copy technology in order to maximize the theoretical link throughput without adding excessive CPU and memory bandwidth overhead, leaving free resources for data processing resulting in less power, space and money used for the same result. We develop a modular test application which can be used with different transport layers. For the zero-copy implementation we choose the OFED IBVerbs API because it can provide low level access and high throughput. We present throughput and CPU usage measurements of 40 GbE solutions using Remote Direct Memory Access (RDMA), for several network configurations to test the scalability of the system

    Performance benchmark of LHCb code on state-of-the-art x86 architectures

    No full text
    For Run 2 of the LHC, LHCb is replacing a significant part of its event filter farm with new compute nodes. For the evaluation of the best performing solution, we have developed a method to convert our high level trigger application into a stand-alone, bootable benchmark image. With additional instrumentation we turned it into a self-optimising benchmark which explores techniques such as late forking, NUMA balancing and optimal number of threads, i.e. it automatically optimises box-level performance. We have run this procedure on a wide range of Haswell-E CPUs and numerous other architectures from both Intel and AMD, including also the latest Intel micro-blade servers. We present results in terms of performance, power consumption, overheads and relative cost
    corecore